DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using Large Collection of 16S rRNA Sequences

نویسندگان

  • Yang Ruan
  • Saliya Ekanayake
  • Mina Rho
  • Haixu Tang
  • Seung-Hee Bae
  • Judy Qiu
  • Geoffrey Fox
چکیده

The development of next-generation sequencing technology has made it possible to generate millions of sequences from environmental samples. However, the difficulty associated with taxonomy-independent analysis increases as the sequence size expands. Most of the existing algorithms, which aim to generate operational taxonomic units (OTUs), require quadratic space and time complexity that makes them only suitable to small datasets. An alternative is to use heuristic methods; although it enables fast sequence analyzing, the hard-cutoff similarity threshold set for it and the random starting seed can result in reduced accuracy and overestimation. In this paper, we propose DACIDR: a parallel sequence clustering and visualization pipeline, which can address the overestimation problem along with space and time complexity issues as well as giving robust result. The pipeline starts with a parallel pairwise sequence alignment analysis followed by a deterministic annealing method of clustering and dimension reduction. No explicit similarity threshold is needed with the process of clustering. Experiments with our system also proved the quadratic time and space complexity issue could be solved with a novel heuristic method called Sample Sequence Partition Tree (SSP-Tree), which allowed us to interpolate millions of sequences with sub-quadratic time and linear space requirement. Furthermore, SSP-Tree can enhance the speed of fine-tuning on the existing result, which made it possible to recursive clustering to achieve accurate local results. Our experiments showed that DACIDR produced a more reliable result than two popular greedy heuristic clustering methods: UCLUST and CD-HIT.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phylogeny of urate oxidase producing bacteria: on the basis of gene sequences of 16S rRNA and uricase protein

      Uricase or Urate oxidase (urate:oxygen oxidoreductase, EC 1.7.3.3), a peroxisomal enzyme which is found in many bacteria, catalyzes the oxidative opening of the purine ring of urate to yield allantoin, carbon dioxide, and hydrogen peroxide. In this study, the phylogeny of urate oxidase (uricase) producing bacteria was studied based on gene sequences of 16S rRNA and uricase protein. Repres...

متن کامل

Molecular Detection of Novel Genetic Variants Associated to Anaplasma ovis among Dromedary Camels in Iran

To the best of our knowledge, little information is available regarding the presence of Anaplasma species in camels in Iran. This study sought to investigate the presence of Anaplasma species by microscopy and polymerase chain reaction (PCR) assays in 100 healthy dromedaries (Camelus dromedarius) arriving for slaughter. The microscopic examination of Giemsa-stained blood films revealed that Ana...

متن کامل

Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data

BACKGROUND The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 seq...

متن کامل

Phylogenetic clustering of soil microbial communities by 16S rRNA but not 16S rRNA genes.

We evaluated phylogenetic clustering of bacterial and archaeal communities from redox-dynamic subtropical forest soils that were defined by 16S rRNA and rRNA gene sequences. We observed significant clustering for the RNA-based communities but not the DNA-based communities, as well as increasing clustering over time of the highly active taxa detected by only rRNA.

متن کامل

Genetic variations of avian Pasteurella multocida as demonstrated by 16S-23S rRNA gene sequences comparison

Pasteurella multocida is known as an important heterogenic bacterial agent causes some severe diseases such as fowl cholera in poultry and haemorrhagic septicaemia in cattle and buffalo. A polymerase chain reaction (PCR) assay was developed using primers derived from conserved part of 16S-23S rRNA gene. The PCR amplified a fragment size of 0.7 kb using DNA from nine avian P. multocida  isolates...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012